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Background / Context: 

Description of prior research and its intellectual context. 


Recent years have seen an increased interest in the use of randomized experiments to evaluate 
the causal effects of educational policies and practices in the United States (Angrist, 2004; 
Spybrook, Cullen and Lininger, 2011). Most large scale randomized experiments in education 
utilize the hierarchical structure of the U.S. educational system (students are nested in 
classrooms, classrooms are nested in schools, etc.) in the experimental design. Two basic types 
of designs are typically used. Cluster randomized or hierarchical trials (abbreviated CRT or HT) 
randomly assign entire clusters of students (such as schools) to a treatment group or a control 
group. Randomized block or multi-site trials (abbreviated RBD or MST) utilize clusters (such as 
schools) as blocks in the experimental design and randomly assign students within these blocks. 

Researchers seeking funding to conduct a HT or MST will generally need to complete a power 
analysis in order to reassure the funder that the sample sizes (at the different levels of the design) 
will be sufficient to result in reasonable statistical power. Recent years have seen the 
development of software programs that facilitate the computation of statistical power when linear 
mixed models are used to analyze the types of cluster randomized and multi-site experiments 
described above. A recent article by Spybrook, Hedges and Borenstein (2014) describes two 
such programs: Optimal Design Plus ( OD Plus , Raudenbush et. al., 2011) and CRT Power 
(Borenstein, Hedges and Rothstein, 2012). 

In order to compute power in either of the two programs listed above the user must specify 
values for certain crucial design parameters. For instance, power for a two-level HT in CRT 
Power is computed after the user specifies the sample sizes at both levels of the hierarchy, the 
effect size , and the intracluster correlation coefficient (ICC). Power for a two-level MST in CRT 
Power is computed after the user specifies the sample sizes at both levels of the hierarchy, the 
effect size , the intracluster correlation coefficient (ICC) and an effect size variance parameter. 

The need to specify an ICC and an effect size for the design of experiments in education has 
generated research seeking to clarify likely values for these parameters. These studies utili z e 
survey based data sources (such as state longitudinal databases) to compute ICCs (eg. Hedges 
and Hedberg, 2007; Westine, Spybrook and Taylor, 2013) or benchmark effect sizes (eg. Bloom, 
Hill, Rebeck-Black and Lipsey, 2008; Scammacca, Fall and Roberts, 2015). Typically the ICCs 
and benchmark effect sizes are entered directly in to the power analysis software by researchers 
planning a MST or HT study. 

There has been much less research on the topic of reasonable values for the effect size variance 
parameter. However, there is some evidence that researchers are beginning to pay more attention 
to obtaining empirical evidence about values of this parameter (Raudenbush, Reardon and Nomi, 
2012; Raudenbush and Bloom, 2015). 

Purpose / Objective / Research Question / Focus of Study: 

Description of the focus of the research. 
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The current paper notes that estimates of ICCs and/or effect sizes that come from surveys may 
need to be adjusted before being utilized to compute power for HTs or MSTs. Current software 
implementations prompt users to make assumptions about heterogeneity in treatment effects 
(effect size variance) only when performing a power analysis for MSTs. However, heterogeneity 
in treatment effects will impact the appropriate value of the ICC to use when planning a 
hierarchical trial. In particular, it is likely that ICCs computed from sample surveys will need to 
be adjusted to account for treatment effect heterogeneity when performing a power analysis for a 
hierarchical trial. 

Similarly, the denominator of benchmark effect size values computed from sample surveys is 
likely to represent the variance of the outcome variable in the control group. When there are 
heterogeneous treatment effects, the variance in the treatment group is unlikely to equal the 
variance in the control group. Hence, benchmark effect size values may also need to be adjusted 
when used in a power analysis. 

Significance / Novelty of study: 

Description of what is missing in previous work and the contribution the study makes. 


Previous work reporting ICCs and effect size benchmarks has failed to note the need for 
adjustments to these values in order to account for heterogeneity in treatment effects when 
planning a research study. Furthermore, writings about the computation of power in HT and 
MST designs typically define parameters solely in terms of “observed data” notation (eg. 
Raudenbush and Liu, 2000; Konstantopooulos, 2008). As a result the connection between these 
two sets of parameters has not been entirely clear. This has meant that a researcher comparing 
power for a multisite design with power for a hierarchical design could not be sure how 
assumptions about parameters made for one design should inform the assumptions made for the 
other design. For instance, should the ICC entered in to CRT Power when planning a two level 
HT be the same as the ICC which is entered when planning a two level MST? 

The current paper resolves this issue using a potential outcomes approach. Writing parameters in 
terms of potential outcomes also makes clear how ICCs and/or effect sizes from surveys need to 
be adjusted to result in a correct power analysis for a HT or a MST. The result is a set of 
correction factors which researchers should use when planning future studies. 

Statistical, Measurement, or Econometric Model: 

Description of the proposed new methods or novel applications of existing methods. 


The paper assumes a design with m schools, each containing 2n students. The statistical models 
considered are the usual linear mixed models used to define a power function for two level 
hierarchical and multisite trails. In particular, the model for the outcome score of the k th student 
in the j th school in the i th treatment group in the hierarchical design is 

Y ijk = M + a, +fi l(l) + s ijk • ( 1 ) 

The fyp parameters represent random effects associated with schools and the Sijk > parameters 
represent student level random errors. The variance of the fyo parameters is <j 2 b H , with the total 
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variance of the outcome measure in the hierarchical design given by <Jj H . Let /j e be the 
average value of the outcome variable when all students are assigned to the experimental 
condition and // c be the average value of the outcome variable when all students are assigned to 

the control condition. Then the effect size and ICC parameters necessary to correctly compute 

2 

u F — u ct b H 

power in the hierarchical design are defined as follows: o h = — and p H = 2 . The model 

< Tr,w v T h 

for the outcome score of the k th student in the j th school in the i th treatment group in the multisite 
design is 

Y £ = P + «, + Yj + aYu + s ijk • ( 2 ) 

The yj parameters represent random variation in school means, the a ytj are random effects 
representing variation in treatment effects across schools with variance ct 2 s m /2 and the Sijk ) 

parameters represent student level random errors. Using the parameter definitions in equations 
(3)-(5) for the effect size, ICC and effect size variance will result in a correct power analysis for 
the multisite design. The symbol <j 2 b c refers to the between school variance in the control group 

and the symbol cr ^ c refers to the within school variance in the control group. 
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(4) 


(5) 


We also write a standard two level hierarchical model in terms of potential outcomes as follows. 
The outcome that would be observed for the k lh student in the / h school in the hypothetical world 
where everyone in the experiment was assigned to the control condition is modelled as 
c c c 

Yjk=p c +Pj +£jkJ = 1’ ...,/«; k = l,...,2n. On the other hand, in the hypothetical world 
where everyone in the experiment was assigned to the experimental condition we would get the 

F F F nC /~)E 

following model: Y j k = ju E + J3 j + e j k , j = 1, ...,m; k =1,...,2«. /T (p ; ) are mean zero, 
random effects having variance a], c E ) . The s ijk are mean zero random effects assumed to 
have the same variance, cr^ , in the experimental and control groups. 


Usefulness / Applicability of Method: 

Demonstration of the usefulness of the proposed methods using hypothetical or real data. 

The usefulness of the method is evident from the description in the Findings/Results section. 

Findings / Results: 

Description of the main findings with specific details. 
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Using the models defined above one can show the following relationship between the parameters 
in the hierarchical and multisite designs <j\ h = M /2 + <7 2 B c + <7 S . The a s cov symbol 

represents the covariance between treatment effects and school specific means in the control 
group. Because it is useful to write results in terms of a standardized version of this parameter 


we define z 


a 


8, COW 


8, cov 


°B,C + ° V 


ICCs computed from surveys describe an ICC defined in terms of control group quantities, 


namely, p s = 


cr 


B,C 


° B.C + ° W 


. Similarly, benchmark effect sizes from surveys represent a mean 


difference standardized by the total standard deviation in the control group, namely, 


8 bench 


P T p c 


fl, 


c 


In the interest of space results are presented only for the case where one is planning a 
hierarchical trial using a survey based ICC and a benchmark effect size. In this case the ICC can 
be multiplied by a single correction factor that will simultaneously adjust for the impact of 
heterogeneity on the ICC and the impact of heterogeneity on the effect size to result in an 

Yl 

accurate power analysis. This factor can be expressed as k h = 1 + — - 

correction factor depends on the covariance between school specific treatment effects and school 
specific control group means. The possible values for this parameter can be bounded using the 
covariance inequality. Table 1 in the appendix tabulates values of k b for various values of z\ , 
p s and z s cov . The notation UB=Upper Bound and LB=Lower Bound is used. 


z n + 2r „ 

2 8, cov 

O o 


. Clearly the 


Conclusions: 

Description of conclusions, recommendations, and limitations based on findings. 


Current practice for conducting power analyses in hierarchical trials using survey based ICC and 
effect size estimates may be misestimating power because ICCs are not being adjusted to account 
for treatment effect heterogeneity. Results presented in Table 1 show that the necessary 
adjustments can be quite large or quite small. Furthermore, power estimates without adjusting 
the ICC could be either too large or too small, depending on the covariance of school specific 
treatment effects with control group school means. The paper illustrates the need for the field to 
obtain better empirical evidence about likely values of this parameter in order to conduct 
accurate power analyses. 
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Appendices 

Not included in page count. 
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Appendix B. Tables and Figures 

Table 1: Correction factor (kb) when planning a hierarchical study with a survey based ICC and 
a benchmark effect size (n=25). 
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